Templatized Transformation using Databricks

Data transformation is the process of converting, cleansing, and structuring data into a usable format that can be analyzed to support decision making processes. The data transformation process converts raw data into a usable format by removing duplicates, converting data types, and enriching the dataset. The process involves defining the structure, mapping the data, extracting the data from the source system.

Data Pipeline Studio (DPS) provides templates for creating transformation jobs. The jobs include join/union/aggregate functions that can be performed to group or combine data for analysis.

For complex operations to be performed on data, DPS provides the option of creating custom transformation jobs. For custom queries while the logic is written by the users, DPS UI provides an option to create SQL queries by selecting specific columns of tables. Calibo Accelerate consumes the SQL queries along with the transformation logic, to generate the code for custom transformation jobs.

To create a Databricks templatized transformation job

  1. Sign in to the Calibo Accelerate platform and navigate to Products.

  2. Select a product and feature. Click the Develop stage of the feature and navigate to Data Pipeline Studio.

  3. Create a pipeline with the following nodes:

    Data Lake (Amazon S3) > Data Transformation (Databricks) > Data Lake (Amazon S3)

    In the data transformation pipeline that you create, you can either add two data lake nodes or a single data lake node and connect the data transformation node to and from the same data lake node.

  4. Click the Databricks node and click Create Templatized Job.

Complete the following steps to create the job:

 

Related Topics Link IconRecommended Topics What's next? Databricks Custom Transformation Job